A Non-linear GPU Thread Map for Triangular Domains

نویسندگان

Cristobal A. Navarro

Benjamin Bustos

Nancy Hitschfeld-Kahler

چکیده

There is a stage in the GPU computing pipeline where a grid of thread-blocks, in parallel space, is mapped onto the problem domain, in data space. Since the parallel space is restricted to a box type geometry, the mapping approach is typically a k-dimensional bounding box (BB) that covers a p-dimensional data space. Threads that fall inside the domain perform computations while threads that fall outside are discarded at runtime. In this work we study the case of mapping threads efficiently onto triangular domain problems and propose a block-space linear map λ(ω), based on the properties of the lower triangular matrix, that reduces the number of unnnecessary threads from O(n2) to O(n). Performance results for global memory accesses show an improvement of up to 18% with respect to the bounding-box approach, placing λ(ω) on second place below the rectangular-box approach and above the recursive-partition and upper-triangular approaches. For shared memory scenarios λ(ω) was the fastest approach achieving 7% of performance improvement while preserving thread locality. The results obtained in this work make λ(ω) an interesting map for efficient GPU computing on parallel problems that define a triangular domain with or without neighborhood interactions. The extension to tetrahedral domains is analyzed, with applications to triplet-interaction n-body applications.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving the GPU space of computation under triangular domain problems

There is a stage in the GPU computing pipeline where a grid of thread-blocks is mapped to the problem domain. Normally, this grid is a k-dimensional bounding box that covers a k-dimensional problem no matter its shape. Threads that fall inside the problem domain perform computations, otherwise they are discarded at runtime. For problems with non-square geometry, this is not always the best idea...

متن کامل

Non-additive Lie centralizer of infinite strictly upper triangular matrices

‎Let $mathcal{F}$ be an field of zero characteristic and $N_{infty‎}(‎mathcal{F})$ be the algebra of infinite strictly upper triangular‎ ‎matrices with entries in $mathcal{F}$‎, ‎and $f:N_{infty}(mathcal{F}‎)rightarrow N_{infty}(mathcal{F})$ be a non-additive Lie centralizer of $‎N_{infty }(mathcal{F})$; that is‎, ‎a map satisfying that $f([X,Y])=[f(X),Y]$‎ ‎for all $X,Yin N_{infty}(mathcal{F})...

متن کامل

Block-Space GPU Mapping for Embedded Sierpiński Gasket Fractals

This work studies the problem of GPU thread mapping for a Sierpiński gasket fractal embedded in a discrete Euclidean space of n × n. A block-space map λ : Z2E 7→ Z 2 F is proposed, from Euclidean parallel space E to embedded fractal space F, that maps in O(log 2 log 2 (n)) time and uses no more than O(n) threads with H ≈ 1.58... being the Hausdorff dimension, making it parallel space efficient....

متن کامل

Improving Inter-thread Data Sharing with GPU Caches

The massive amount of fine-grained parallelism exposed by a GPU program makes it difficult to exploit shared cache benefits even there is good program locality. The non deterministic feature of thread execution in the bulk synchronize parallel (BSP) model makes the situation even worse. Most prior work in exploiting GPU cache sharing focuses on regular applications that have linear memory acces...

متن کامل